Goto

Collaborating Authors

 judge system


Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

arXiv.org Artificial Intelligence

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.


AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents

arXiv.org Artificial Intelligence

Accurate and systematic evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks for mobile agents lack practicality and scalability due to the extensive manual effort required to define task reward signals and implement corresponding evaluation codes. To this end, we propose AutoEval, an autonomous agent evaluation framework that tests a mobile agent without any manual effort. First, we design a Structured Substate Representation to describe the UI state changes while agent execution, such that task reward signals can be automatically generated. Second, we utilize a Judge System that can autonomously evaluate agents' performance given the automatically generated task reward signals. By providing only a task description, our framework evaluates agents with fine-grained performance feedback to that task without any extra manual effort. We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals. Moreover, to prove the effectiveness of our autonomous Judge System, we manually verify its judge results and demonstrate that it achieves 94% accuracy. Finally, we evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.


Verdict: A Library for Scaling Judge-Time Compute

arXiv.org Artificial Intelligence

The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units -- such as verification, debate, and aggregation -- and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieve state-of-the-art (SOTA) or near-SOTA performance, surpassing orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Ultimately, we hope Verdict serves as a useful framework for researchers and practitioners building scalable, interpretable, and reliable LLM-based evaluators.


A GPT-based Code Review System for Programming Language Learning

arXiv.org Artificial Intelligence

The increasing demand for programming language education and growing class sizes require immediate and personalized feedback. However, traditional code review methods have limitations in providing this level of feedback. As the capabilities of Large Language Models (LLMs) like GPT for generating accurate solutions and timely code reviews are verified, this research proposes a system that employs GPT-4 to offer learner-friendly code reviews and minimize the risk of AI-assist cheating. To provide learner-friendly code reviews, a dataset was collected from an online judge system, and this dataset was utilized to develop and enhance the system's prompts. In addition, to minimize AI-assist cheating, the system flow was designed to provide code reviews only for code submitted by a learner, and a feature that highlights code lines to fix was added. After the initial system was deployed on the web, software education experts conducted usability test. Based on the results, improvement strategies were developed to improve code review and code correctness check module, thereby enhancing the system. The improved system underwent evaluation by software education experts based on four criteria: strict code correctness checks, response time, lower API call costs, and the quality of code reviews. The results demonstrated a performance to accurately identify error types, shorten response times, lower API call costs, and maintain high-quality code reviews without major issues. Feedback from participants affirmed the tool's suitability for teaching programming to primary and secondary school students. Given these benefits, the system is anticipated to be a efficient learning tool in programming language learning for educational settings.